Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 23, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • CSP health monitor adds a configurable processing strategy via Helm and a runtime flag: EXECUTE_REMEDIATION (default) and STORE_ONLY. The chosen strategy is included in emitted health events and respected at runtime.
  • Tests

    • Added integration tests for STORE_ONLY semantics and updated event-exporter validations to assert processingStrategy.
    • New test helpers for rollout checks and updating/removing container deployment arguments.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

Warning

Rate limit exceeded

@tanishagoyal2 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 35 seconds before requesting another review.

βŒ› How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 0fdaae2 and 400aab2.

πŸ“’ Files selected for processing (4)
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/csp_health_monitor_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
πŸ“ Walkthrough

Walkthrough

Adds a configurable processing strategy (EXECUTE_REMEDIATION or STORE_ONLY): Helm value and deployment flag, CLI parsing/validation, threaded the strategy into the trigger engine so emitted pb.HealthEvent includes it, and tests/helpers to verify STORE_ONLY behavior.

Changes

Cohort / File(s) Summary
Config & Deployment
distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml, distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
Adds --processing-strategy={{ .Values.processingStrategy }} to the container and introduces processingStrategy Helm value (default EXECUTE_REMEDIATION, option STORE_ONLY) with documentation.
CLI / Main
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
Adds --processing-strategy CLI flag, validates it against protobuf enum, logs the choice, and passes validated pb.ProcessingStrategy into engine construction.
Trigger Engine
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
Adds processingStrategy pb.ProcessingStrategy field to Engine, updates NewEngine(...) signature to accept the strategy, assigns it, and sets ProcessingStrategy on constructed pb.HealthEvent.
Trigger Engine Tests
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
Test constructor calls updated to pass pb.ProcessingStrategy_EXECUTE_REMEDIATION; test behavior unchanged.
Integration Tests
tests/csp_health_monitor_test.go
Adds TestCSPHealthMonitorStoreOnlyProcessingStrategy to assert STORE_ONLY semantics (event stored/exported, node not cordoned, changestream indicates STORE_ONLY). (Test was added twice in diff.)
Event Exporter Tests & Helpers
tests/event_exporter_test.go, tests/helpers/event_exporter.go
ValidateCloudEvent extended to accept expected processing strategy and assert healthEvent["processingStrategy"]; added FindEventByNodeAndCheckName helper.
Kubernetes Test Utilities
tests/helpers/kube.go
Adds WaitForDaemonSetRollout, SetDeploymentArgs, RemoveDeploymentArgs, and internal helpers (setArgsOnContainer, removeArgsFromContainer) with retry-on-conflict logic to modify container args for test orchestration.

Sequence Diagram

sequenceDiagram
    participant Helm as Helm values
    participant Deployment as Deployment YAML
    participant CLI as maintenance-notifier (main)
    participant Engine as Trigger Engine
    participant Store as Datastore
    participant Exporter as Event Exporter

    Helm->>Deployment: supply processingStrategy value
    Deployment->>CLI: container started with --processing-strategy
    CLI->>CLI: validate value against protobuf enum
    CLI->>Engine: NewEngine(..., processingStrategy)
    Engine->>Engine: store processingStrategy on Engine
    Engine->>Store: persist HealthEvent (includes processingStrategy)
    Engine->>Exporter: emit HealthEvent/CloudEvent (includes processingStrategy)
    Exporter->>Exporter: export semantics reflect processingStrategy
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

πŸ‡ A flag from Helm to the run, so spry,
EXECUTE or STORE β€” events choose why.
Engine carries the hop, each HealthEvent sings,
Stored or acted, depending on wings.
Tests roll and check the strategy’s light.

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title accurately describes the main change: adding event handling strategy configuration to csp-health-monitor, making the processingStrategy feature configurable across deployment, values, and engine implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/helpers/kube.go (1)

2338-2393: Minor: Docstring mentions environment variables instead of arguments.

The function docstring says "removes environment variables" but this function removes container arguments.

Suggested fix for docstring
-// RemoveDeploymentArgs removes environment variables from containers in a deployment.
+// RemoveDeploymentArgs removes arguments from containers in a deployment.
 // If containerName is empty, removes from all containers. Otherwise, removes only from the named container.
 // Uses retry.RetryOnConflict for automatic retry handling.
tests/csp_health_monitor_test.go (1)

522-535: Consider using the return value of TeardownCSPHealthMonitorTest.

The teardown correctly restores the processing strategy, but TeardownCSPHealthMonitorTest returns a context that is discarded. Other tests in this file return the result of this call. While this may be intentional, it should be consistent with other teardown patterns.

-		helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx)
-
-		return ctx
+		return helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 82e7180 and 61f34e7.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (23)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/csp_health_monitor_test.go
  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (5)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/healthevent.go
  • store-client/pkg/client/pipeline_builder_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • tests/event_exporter_test.go
  • store-client/pkg/client/pipeline_builder.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/helpers/kube.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • event-exporter/pkg/transformer/cloudevents.go
  • health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
  • tests/csp_health_monitor_test.go
  • fault-quarantine/pkg/initializer/init.go
  • tests/helpers/event_exporter.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • store-client/pkg/client/pipeline_builder_test.go
  • tests/event_exporter_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/csp_health_monitor_test.go
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (7)
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • store-client/pkg/client/pipeline_builder_test.go
  • tests/event_exporter_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/csp_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/event_exporter_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation

Applied to files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
🧬 Code graph analysis (10)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-47)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
  • ValidateCloudEvent (257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (19)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • Entity (208-214)
  • Entity (227-227)
  • Entity (242-244)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
  • HealthEvents (156-162)
  • HealthEvents (175-175)
  • HealthEvents (190-192)
platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
  • NewK8sConnector (47-58)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
  • NewEngine (70-86)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
data-models/pkg/protos/health_event_grpc.pb.go (1)
  • PlatformConnectorClient (43-45)
data-models/pkg/protos/health_event.pb.go (8)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (56-59)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
  • NewEngine (70-86)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
πŸ”‡ Additional comments (31)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM!

The addition of processingStrategy with float64(0) correctly reflects the default protobuf enum value (EXECUTE_REMEDIATION = 0) after JSON serialization, which represents numbers as float64.

tests/helpers/healthevent.go (2)

48-48: LGTM!

The new ProcessingStrategy field follows the same pattern as RecommendedAction (also an int with omitempty), maintaining consistency in the test helper struct.


153-156: LGTM!

The fluent setter follows the established builder pattern used throughout this file.

distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml (1)

52-57: Well documented configuration option.

The inline comments clearly explain the valid values and their behavioral implications, following the Helm chart documentation guidelines. The default EXECUTE_REMEDIATION maintains backward compatibility.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (3)

325-343: LGTM - Clean filtering implementation.

The filterProcessableEvents function correctly filters out STORE_ONLY events with appropriate logging for observability. The function is well-structured and follows Go conventions.


345-370: Good refactor - centralizes K8s event creation.

The createK8sEvent helper consolidates event construction logic that was previously inline, improving maintainability and reducing code duplication.


372-416: LGTM - Correctly updated to use filtered events.

The processHealthEvents function properly filters events before processing and uses the filtered slice (processableEvents) consistently for both node condition updates and K8s event creation.

tests/helpers/kube.go (2)

2207-2249: LGTM - Well-implemented DaemonSet rollout helper.

The function follows the same pattern as WaitForDeploymentRollout, with appropriate status checks and logging. It correctly validates that all pods are scheduled, updated, and ready.


2251-2336: LGTM - Comprehensive argument handling.

The SetDeploymentArgs and setArgsOnContainer functions properly handle multiple argument formats (--flag=value, --flag value, and boolean flags). The retry-on-conflict pattern is correctly applied.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM!

The addition of processingStrategy follows the same pattern as recommendedAction (line 61), using the protobuf enum's .String() method for consistent serialization in the CloudEvent payload.

distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml (1)

161-161: LGTM!

The new --processing-strategy argument correctly references the Helm value defined in values.yaml, maintaining consistency with other configuration options passed to the maintenance-notifier container.

Consider whether invalid values should fail at Helm rendering time rather than runtime. You could add a validation check in the template:

{{- if not (or (eq .Values.processingStrategy "EXECUTE_REMEDIATION") (eq .Values.processingStrategy "STORE_ONLY")) }}
{{- fail "processingStrategy must be either EXECUTE_REMEDIATION or STORE_ONLY" }}
{{- end }}

This would catch typos during helm install/upgrade rather than at container startup.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM - Correct pipeline selection for fault-quarantine.

Switching to BuildProcessableHealthEventInsertsPipeline() ensures the fault-quarantine module only processes health events with EXECUTE_REMEDIATION strategy, filtering out STORE_ONLY events at the data layer. This aligns with the PR objective of allowing observability-only event handling.

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM!

The test follows the established table-driven pattern and validates the new pipeline method across both MongoDB and PostgreSQL builders. The assertions are appropriate for verifying pipeline structure.

tests/event_exporter_test.go (1)

85-85: LGTM!

The test correctly validates the new processingStrategy field with the expected value "EXECUTE_REMEDIATION".

tests/helpers/event_exporter.go (2)

220-254: LGTM!

The new helper function follows the established pattern of FindEventByNodeAndMessage and provides a useful search capability for CloudEvents by node name, check name, and health status.


256-283: LGTM!

The extended validation correctly asserts the new processingStrategy field in CloudEvents, maintaining consistency with the protobuf schema changes.

store-client/pkg/client/postgresql_pipeline_builder.go (1)

119-132: LGTM!

The pipeline correctly filters for INSERT operations with processingStrategy=EXECUTE_REMEDIATION. The implementation aligns with the existing pipeline patterns in the file.

event-exporter/pkg/transformer/cloudevents_test.go (1)

69-69: LGTM!

The test correctly validates that the ProcessingStrategy field is properly serialized to CloudEvents. The use of t.Errorf for assertions aligns with the coding guidelines for simple equality checks.

Also applies to: 106-108

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM!

The new interface method is well-documented with clear usage guidance. The godoc comment effectively explains the filtering behavior for processing-strategy aware pipelines.

data-models/protobufs/health_event.proto (2)

32-38: LGTM!

The ProcessingStrategy enum is well-documented with clear semantic definitions. Using EXECUTE_REMEDIATION = 0 as the default value ensures backward compatibilityβ€”existing events without this field will automatically have execution behavior, which is the expected default.


77-77: LGTM!

The new field follows protobuf best practices with sequential field numbering and uses the appropriate enum type.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (1)

193-193: LGTM!

All test cases correctly updated to pass pb.ProcessingStrategy_EXECUTE_REMEDIATION to the NewEngine constructor, maintaining consistency with the new signature.

Also applies to: 207-207, 612-612, 793-793, 839-839

health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)

61-61: LGTM! CLI flag and configuration integration looks correct.

The new --processing-strategy flag with default EXECUTE_REMEDIATION follows the existing flag patterns in this file. The help text clearly documents the valid options.

Also applies to: 75-76


219-227: Validation logic is correct and idiomatic.

Using pb.ProcessingStrategy_value map for validation ensures only valid protobuf enum values are accepted. The error message is clear about expected values. The cast to pb.ProcessingStrategy(value) is safe since value was validated.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)

1391-1506: Comprehensive test coverage for STORE_ONLY processing strategy.

The test cases thoroughly cover the expected behavior:

  • STORE_ONLY events should not create node conditions or K8s events
  • EXECUTE_REMEDIATION events should create appropriate conditions
  • Mixed batches correctly process only EXECUTE_REMEDIATION events

The table-driven approach with descriptive names follows Go testing best practices.


1550-1562: Good approach to isolating NVSentinel-specific conditions.

The filtering logic correctly excludes all standard Kubernetes node condition types before counting. This ensures the test validates only conditions created by the health event processing.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Auto-generated protobuf file - no manual review needed.

This file is generated by the protocol buffer compiler (as noted in line 2). The changes reflect the addition of the ProcessingStrategy enum to the proto schema. Ensure this was regenerated from the updated .proto source rather than manually edited.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)

58-67: Engine struct correctly extended with processingStrategy field.

The new field follows the existing struct field patterns. This design means the processing strategy is configured at the engine level (via CLI flag) and applies uniformly to all emitted health events.


69-86: Constructor updated correctly with new parameter.

The processingStrategy parameter is properly added and stored. This is a breaking API change - callers must now provide the strategy. Per the AI summary, tests have been updated accordingly.


348-366: ProcessingStrategy correctly propagated to health events.

The ProcessingStrategy field is populated from the engine's configured value, ensuring all health events from this engine instance carry the same strategy. This aligns with the CLI-based configuration approach.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-18: Auto-generated Python stub file correctly reflects ProcessingStrategy additions.

The .pyi stub file properly exposes:

  • ProcessingStrategy enum class with EXECUTE_REMEDIATION and STORE_ONLY values
  • Module-level enum constants
  • processingStrategy field in HealthEvent with correct typing

This is generated code that mirrors the proto schema changes.

Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-csp-monitor branch from 61f34e7 to d314731 Compare January 12, 2026 11:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)

348-369: Critical bug: ProcessingStrategy is set twice, hardcoded value overrides configured value.

The ProcessingStrategy field is assigned twice in this struct literal:

  1. Line 354: ProcessingStrategy: e.processingStrategy (correct, uses configured value)
  2. Line 368: ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATION (hardcoded)

In Go struct literals, the last assignment wins, so the hardcoded EXECUTE_REMEDIATION will always override the configured e.processingStrategy, making the entire feature non-functional. The TODO comment references this PR but the hardcoded line should be removed.

πŸ”§ Proposed fix - remove duplicate assignment and stale TODO
 	healthEvent := &pb.HealthEvent{
 		Agent:              "csp-health-monitor", // Consistent agent name
 		ComponentClass:     event.ResourceType,   // e.g., "EC2", "gce_instance"
 		CheckName:          "CSPMaintenance",     // Consistent check name
 		IsFatal:            isFatal,
 		IsHealthy:          isHealthy,
 		ProcessingStrategy: e.processingStrategy,
 		Message:            message,
 		RecommendedAction:  pb.RecommendedAction(actionEnum),
 		EntitiesImpacted: []*pb.Entity{
 			{
 				EntityType:  event.ResourceType,
 				EntityValue: event.ResourceID, // CSP's ID (e.g., instance-id, full gcp resource name)
 			},
 		},
 		Metadata:           event.Metadata, // Pass along metadata
 		NodeName:           event.NodeName, // K8s node name
 		GeneratedTimestamp: timestamppb.New(time.Now()),
-		// TODO: Remove hardcoded processing strategy and make it configurable via the config file.
-		// PR: https://github.com/NVIDIA/NVSentinel/pull/641
-		ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATION,
 	}
πŸ€– Fix all issues with AI agents
In @tests/helpers/kube.go:
- Around line 2362-2364: The function docstring for RemoveDeploymentArgs
incorrectly says it "removes environment variables"; update the comment to
accurately describe that RemoveDeploymentArgs removes container command-line
arguments (args) from a deployment, specifying that if containerName is empty it
removes args from all containers, otherwise only from the named container, and
keep the note about using retry.RetryOnConflict for automatic retry handling;
ensure the summary line and any parameter descriptions reflect "arguments/args"
rather than "environment variables."
🧹 Nitpick comments (1)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (1)

219-227: Consider rejecting UNSPECIFIED as a valid processing strategy.

The validation uses pb.ProcessingStrategy_value which includes UNSPECIFIED (value 0), but the help text and values.yaml documentation only mention EXECUTE_REMEDIATION and STORE_ONLY as valid options. Allowing UNSPECIFIED could lead to undefined behavior downstream.

πŸ’‘ Proposed fix to explicitly reject UNSPECIFIED
 		value, ok := pb.ProcessingStrategy_value[appCfg.processingStrategy]
-		if !ok {
+		if !ok || pb.ProcessingStrategy(value) == pb.ProcessingStrategy_UNSPECIFIED {
 			return fmt.Errorf("invalid processingStrategy %q (expected EXECUTE_REMEDIATION or STORE_ONLY)",
 				appCfg.processingStrategy)
 		}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 61f34e7 and d314731.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (9)
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
  • health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/csp_health_monitor_test.go
  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/helpers/event_exporter.go
  • tests/csp_health_monitor_test.go
  • tests/event_exporter_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/helpers/kube.go
  • health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
🧠 Learnings (5)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (60-64)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
  • NewEngine (70-86)
πŸ”‡ Additional comments (7)
tests/helpers/kube.go (2)

2232-2273: LGTM!

The WaitForDaemonSetRollout function correctly checks all the relevant DaemonSet status conditions (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) and follows the established patterns in this file with proper logging and require.Eventually usage.


2275-2360: LGTM!

The SetDeploymentArgs function correctly follows the established patterns from SetDeploymentEnvVars, with proper retry-on-conflict handling and container targeting. The setArgsOnContainer helper handles multiple argument formats (--flag=value, --flag value, boolean flags) appropriately.

distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml (1)

161-161: LGTM!

The new --processing-strategy argument correctly references .Values.processingStrategy and follows the established pattern of other container arguments in this deployment template.

distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml (1)

52-57: LGTM!

The new processingStrategy configuration is well documented with clear descriptions of both modes. The default EXECUTE_REMEDIATION maintains backward compatibility with existing deployments. As per coding guidelines, the inline comments adequately document the configuration options.

health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (1)

61-61: LGTM!

The new processingStrategy field and CLI flag are correctly added to the application configuration with an appropriate default value and descriptive help text.

Also applies to: 75-76

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)

58-67: LGTM!

The Engine struct correctly includes the new processingStrategy field with appropriate type pb.ProcessingStrategy.


69-86: LGTM!

The NewEngine constructor correctly accepts the new processingStrategy parameter and initializes it on the Engine struct.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/helpers/kube.go (1)

2362-2390: Missing container-not-found check for consistency.

Unlike SetDeploymentArgs (line 2306-2308) and RemoveDeploymentEnvVars (line 1259-1260), this function doesn't return an error when the specified containerName is not found. This inconsistency could mask configuration errors in tests.

♻️ Suggested fix for consistency
 func RemoveDeploymentArgs(
 	ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string,
 ) error {
 	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		deployment := &appsv1.Deployment{}
 		if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil {
 			return err
 		}

 		if len(deployment.Spec.Template.Spec.Containers) == 0 {
 			return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
 		}

+		found := false
+
 		for i := range deployment.Spec.Template.Spec.Containers {
 			container := &deployment.Spec.Template.Spec.Containers[i]

 			if containerName != "" && container.Name != containerName {
 				continue
 			}

+			found = true
+
 			removeArgsFromContainer(container, args)
 		}

+		if containerName != "" && !found {
+			return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+		}
+
 		return c.Resources().Update(ctx, deployment)
 	})
 }
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between d314731 and 8296232.

πŸ“’ Files selected for processing (2)
  • tests/csp_health_monitor_test.go
  • tests/helpers/kube.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/csp_health_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/kube.go
🧠 Learnings (2)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/helpers/kube.go
πŸ”‡ Additional comments (4)
tests/helpers/kube.go (4)

2232-2273: LGTM!

The DaemonSet rollout wait logic correctly checks all required conditions (DesiredNumberScheduled > 0, UpdatedNumberScheduled, NumberReady) and follows the established patterns in this file. Good use of t.Helper() and informative logging.


2275-2312: LGTM!

The function follows the established pattern of SetDeploymentEnvVars and correctly handles retry-on-conflict without error wrapping inside the retry block, as per coding guidelines. Good documentation in the godoc comment.


2314-2360: LGTM, logic handles multiple argument formats correctly.

The function correctly handles the three common CLI argument styles (--flag=value, --flag, --flag value). The slice mutation during iteration is safe since you break immediately after any modification.


2392-2417: LGTM!

The removal logic correctly handles both --flag=value and --flag value styles. The slice bounds are safe since the j+1 < len(container.Args) check guards the two-element removal case.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-csp-monitor branch from 8296232 to ca69902 Compare January 12, 2026 15:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

πŸ€– Fix all issues with AI agents
In @tests/csp_health_monitor_test.go:
- Around line 500-513: In the require.Eventually closure, fix the variable
shadowing and return logic: assign the found event to the predeclared
receivedEvent (replace using the local event only) so ValidateCloudEvent
receives the populated map (use the map[string]any from
FindEventByNodeAndCheckName), and after successful helpers.ValidateCloudEvent
return true instead of false so the Eventually loop can stop; update the closure
around FindEventByNodeAndCheckName, receivedEvent, and the ValidateCloudEvent
call accordingly.
- Around line 518-531: Teardown incorrectly sets the csp-health-monitor
maintenance-notifier container arg "--processing-strategy" to "STORE_ONLY" (same
as setup) instead of restoring the default; update the Teardown block that calls
helpers.SetDeploymentArgs for service "csp-health-monitor" / container
"maintenance-notifier" to either remove the "--processing-strategy" arg or set
it to the default value "EXECUTE_REMEDIATION" (use whichever helper exists,
e.g., a RemoveDeploymentArg or SetDeploymentArgs with "--processing-strategy":
"EXECUTE_REMEDIATION") so subsequent tests are not affected.
- Around line 431-435: SetDeploymentArgs return value is ignored; update the
test to capture its error and fail the test on error. Call
helpers.SetDeploymentArgs(ctx, client, "csp-health-monitor",
helpers.NVSentinelNamespace, "maintenance-notifier",
map[string]string{"--processing-strategy":"STORE_ONLY"}) into an err variable,
check if err != nil and call t.Fatalf or t.Fatalf("SetDeploymentArgs failed:
%v", err) before proceeding to helpers.WaitForDeploymentRollout so configuration
failures are reported immediately.
🧹 Nitpick comments (1)
tests/helpers/kube.go (1)

2362-2390: Consider adding container existence validation for consistency.

Unlike SetDeploymentArgs, RemoveDeploymentArgs doesn't validate that the specified container exists when containerName is provided. While idempotent removal is reasonable, this inconsistency could mask configuration errors.

♻️ Optional: Add container validation
+		found := false
+
 		for i := range deployment.Spec.Template.Spec.Containers {
 			container := &deployment.Spec.Template.Spec.Containers[i]
 
 			if containerName != "" && container.Name != containerName {
 				continue
 			}
 
+			found = true
+
 			removeArgsFromContainer(container, args)
 		}
 
+		if containerName != "" && !found {
+			return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+		}
+
 		return c.Resources().Update(ctx, deployment)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 8296232 and ca69902.

πŸ“’ Files selected for processing (3)
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/csp_health_monitor_test.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/helpers/kube.go
  • tests/csp_health_monitor_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/csp_health_monitor_test.go
🧠 Learnings (7)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/helpers/kube.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/csp_health_monitor_test.go
🧬 Code graph analysis (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)
health-monitors/csp-health-monitor/pkg/datastore/datastore.go (1)
  • Store (33-52)
data-models/pkg/protos/health_event_grpc.pb.go (1)
  • PlatformConnectorClient (43-45)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (45-45)
  • ProcessingStrategy (77-79)
  • ProcessingStrategy (81-83)
  • ProcessingStrategy (90-92)
tests/csp_health_monitor_test.go (8)
tests/helpers/csp_health_monitor.go (4)
  • CSPHealthMonitorTestContext (42-47)
  • SetupCSPHealthMonitorTest (77-81)
  • AddGCPInstanceIDAnnotation (256-277)
  • WaitForCSPHealthMonitorPoll (404-410)
health-monitors/csp-health-monitor/pkg/csp/aws/aws.go (1)
  • NewClient (115-198)
health-monitors/csp-health-monitor/pkg/csp/gcp/gcp.go (1)
  • NewClient (133-217)
tests/helpers/kube.go (7)
  • SetDeploymentArgs (2279-2312)
  • NVSentinelNamespace (64-64)
  • WaitForDeploymentRollout (984-1125)
  • GetNodeByName (466-475)
  • EnsureNodeConditionNotPresent (1797-1818)
  • EventuallyWaitTimeout (61-61)
  • WaitInterval (63-63)
tests/helpers/csp_api_mock.go (2)
  • CSPGCP (45-45)
  • CSPMaintenanceEvent (49-66)
data-models/pkg/model/maintenance_event.go (2)
  • CSP (43-43)
  • MaintenanceType (46-46)
tests/helpers/fault_quarantine.go (2)
  • AssertQuarantineState (317-384)
  • QuarantineAssertion (56-60)
tests/helpers/event_exporter.go (3)
  • GetMockEvents (36-99)
  • FindEventByNodeAndCheckName (221-254)
  • ValidateCloudEvent (257-283)
πŸ”‡ Additional comments (6)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)

58-67: LGTM!

The processingStrategy field is properly added to the Engine struct, allowing the strategy to be threaded through event processing.


70-86: LGTM!

The constructor properly accepts and initializes the processingStrategy parameter, maintaining consistency with the existing pattern for other configuration parameters.


348-366: LGTM!

The ProcessingStrategy is correctly propagated from the Engine to the emitted HealthEvent. Based on learnings, this correctly allows both healthy and unhealthy events to use either processing strategy as needed.

tests/helpers/kube.go (3)

2231-2273: LGTM!

The WaitForDaemonSetRollout helper follows the established pattern from WaitForDeploymentRollout, properly checking DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady status fields.


2275-2360: LGTM!

The SetDeploymentArgs helper correctly handles multiple argument formats (--flag=value, --flag value, and boolean --flag). The use of retry.RetryOnConflict follows the coding guidelines for retry behavior within retry blocks.


2392-2417: LGTM!

The removeArgsFromContainer helper correctly handles both --flag=value and --flag value argument styles. The slice operations are safe even at boundary conditions.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-csp-monitor branch from ca69902 to e3d61ff Compare January 12, 2026 15:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

πŸ€– Fix all issues with AI agents
In @tests/csp_health_monitor_test.go:
- Around line 500-513: The Eventually block is incorrect: it declares
receivedEvent but never assigns it and calls helpers.ValidateCloudEvent with
that uninitialized variable, and it returns false even after successful
validation. Replace usage of receivedEvent with the event returned by
helpers.FindEventByNodeAndCheckName (use the local variable event), and change
the final return value to true after helpers.ValidateCloudEvent succeeds so the
require.Eventually stops on success; keep the initial not-found case returning
false as-is.
- Around line 518-531: The Teardown currently forces the processing strategy to
"STORE_ONLY" which repeats the setup change; instead restore the
original/default strategy: capture the current deployment args during Setup
(e.g., read and save the processing strategy for
"csp-health-monitor"/"maintenance-notifier" in Setup) and in the Teardown
closure call helpers.SetDeploymentArgs with that saved value rather than
hardcoding "STORE_ONLY"; alternatively, if you cannot read the original at
Setup, set the strategy back to the expected default "EXECUTE_REMEDIATION" in
the Teardown to avoid impacting subsequent tests.

In @tests/helpers/kube.go:
- Around line 2362-2390: RemoveDeploymentArgs currently skips when a specific
containerName is provided but not present, causing silent no-ops; update
RemoveDeploymentArgs to track whether a matching container was found (e.g., a
bool flag while iterating deployment.Spec.Template.Spec.Containers), and after
the loop if containerName != "" and no match was found return an error like
fmt.Errorf("container %s not found in deployment %s/%s", containerName,
namespace, deploymentName). Ensure this check mirrors the behavior in
SetDeploymentArgs and RemoveDeploymentEnvVars and keep calls to
removeArgsFromContainer and c.Resources().Update unchanged.
🧹 Nitpick comments (1)
tests/csp_health_monitor_test.go (1)

424-425: Unused variable injectedEventID.

The variable is assigned on line 480 but never used afterward. If you don't need to track the event ID for status updates (unlike other tests), consider using _ for the first return value.

♻️ Optional cleanup
-	var injectedEventID string
 	var testInstanceID string

And on line 480:

-		injectedEventID, _, err = testCtx.CSPClient.InjectEvent(event)
+		_, _, err = testCtx.CSPClient.InjectEvent(event)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between ca69902 and e3d61ff.

πŸ“’ Files selected for processing (3)
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/csp_health_monitor_test.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/csp_health_monitor_test.go
  • tests/helpers/kube.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/csp_health_monitor_test.go
🧠 Learnings (6)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/csp_health_monitor_test.go
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
data-models/pkg/protos/health_event.pb.go (8)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
  • RecommendedAction (93-93)
  • RecommendedAction (143-145)
  • RecommendedAction (147-149)
  • RecommendedAction (156-158)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
  • ProcessingStrategy (14-18)
  • RecommendedAction (20-30)
πŸ”‡ Additional comments (8)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)

58-67: LGTM!

The processingStrategy field addition follows the existing struct pattern with proper typing using the protobuf enum type.


348-366: LGTM!

The ProcessingStrategy field is correctly assigned from the engine's configuration. This approach ensures consistent processing strategy across all health events emitted by this engine, which aligns with the design where healthy events can legitimately use EXECUTE_REMEDIATION when the Fault Quarantine Manager needs to act on them. Based on learnings, this is valid behavior for the NVSentinel system.


69-86: No action neededβ€”validation occurs upstream.

Upstream validation in health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (lines 219-223) validates the processingStrategy string flag against valid enum values before calling NewEngine. The flag default is EXECUTE_REMEDIATION, and only validated values (EXECUTE_REMEDIATION or STORE_ONLY) are passed to the constructor. The constructor correctly follows the pattern where the caller guarantees valid input.

Likely an incorrect or invalid review comment.

tests/csp_health_monitor_test.go (1)

427-458: LGTM on setup logic.

The setup correctly configures the STORE_ONLY processing strategy via deployment args before running the test, and waits for the deployment rollout. The flow is appropriate since SetDeploymentArgs triggers a rollout that effectively restarts the pods.

tests/helpers/kube.go (4)

2232-2273: LGTM!

The WaitForDaemonSetRollout function correctly validates DaemonSet rollout completion by checking all three status conditions (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady). The implementation follows the established patterns in this file, uses appropriate test helpers (t.Helper()), and provides good logging for debugging test failures.


2275-2312: LGTM!

The function follows the same well-established pattern as SetDeploymentEnvVars (lines 1167-1200) with proper retry-on-conflict semantics. Error handling correctly returns unwrapped errors from API calls to preserve retry behavior while wrapping non-retryable errors with context. As per coding guidelines, errors within retry.RetryOnConflict blocks are handled correctly.


2314-2360: LGTM!

The setArgsOnContainer helper correctly handles the three common CLI argument formats (--flag=value, --flag value, --flag boolean). The slice manipulation at line 2343 is correct, and the break after each modification prevents index corruption issues.

Minor note: The heuristic at line 2338 (!strings.HasPrefix(container.Args[j+1], "-")) to distinguish values from flags could theoretically misidentify negative numbers, but this is a reasonable trade-off for typical Kubernetes container arguments and matches common patterns in CLI parsing.


2392-2417: LGTM!

The slice manipulation is correct and bounds-safe. The break after each removal prevents index corruption from the shrinking slice. The function correctly handles all three argument formats and the bounds check at line 2406 (j+1 < len(container.Args)) properly guards against out-of-bounds access.

Note: The function removes only the first occurrence of each flag, which is appropriate behavior for typical container argument configurations.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-csp-monitor branch 2 times, most recently from 75309ae to 0fdaae2 Compare January 13, 2026 04:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @tests/csp_health_monitor_test.go:
- Around line 504-517: Teardown is resetting the deployment to STORE_ONLY,
ignoring the SetDeploymentArgs error, and discarding the context return from
TeardownCSPHealthMonitorTest; change the Teardown body to call
helpers.SetDeploymentArgs with "--processing-strategy" set back to
"EXECUTE_REMEDIATION" and check its error with require.NoError(t, err) like in
setup, and capture and return the context value from
helpers.TeardownCSPHealthMonitorTest (i.e., ctx =
helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx); return ctx) so the
original processing strategy and context are properly restored; keep existing
calls to helpers.WaitForDeploymentRollout and require.NoError for c.NewClient().
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between e3d61ff and 0fdaae2.

πŸ“’ Files selected for processing (3)
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/csp_health_monitor_test.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/csp_health_monitor_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/csp_health_monitor_test.go
🧠 Learnings (6)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/csp_health_monitor_test.go
πŸ“š Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (4)
health-monitors/csp-health-monitor/pkg/datastore/datastore.go (1)
  • Store (33-52)
data-models/pkg/protos/health_event_grpc.pb.go (1)
  • PlatformConnectorClient (43-45)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (44-44)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (80-82)
  • ProcessingStrategy (89-91)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: ko-build-test (event-exporter, .)
  • GitHub Check: ko-build-test (health-monitors/kubernetes-object-monitor, .)
  • GitHub Check: ko-build-test (node-drainer, .)
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
  • GitHub Check: ko-build-test (fault-remediation, .)
  • GitHub Check: ko-build-test (labeler, .)
  • GitHub Check: ko-build-test (janitor, .)
  • GitHub Check: ko-build-test (fault-quarantine, .)
  • GitHub Check: ko-build-test (platform-connectors, .)
  • GitHub Check: ko-build-test (health-events-analyzer, .)
  • GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
  • GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
  • GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: CodeQL PR Analysis
πŸ”‡ Additional comments (8)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)

58-86: LGTM!

The processingStrategy field is correctly added to the Engine struct, properly initialized via the constructor, and appropriately used in the event mapping function. This cleanly threads the configurable strategy through the trigger engine lifecycle.


348-366: LGTM!

The ProcessingStrategy is correctly set on the HealthEvent from the engine's configured strategy, replacing the previous hardcoded default. This aligns with the PR's objective to make the processing strategy configurable.

tests/helpers/kube.go (4)

2232-2273: LGTM!

The WaitForDaemonSetRollout helper correctly checks all relevant DaemonSet status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) and follows the established polling pattern used by other helpers in this file.


2275-2312: LGTM!

The SetDeploymentArgs function correctly uses retry.RetryOnConflict for automatic retry handling as per coding guidelines, validates container existence, and properly delegates to the internal helper.


2314-2360: LGTM!

The setArgsOnContainer helper comprehensively handles the three common argument styles (--flag=value, --flag value, and boolean --flag). The logic correctly updates existing args or appends new ones as needed.


2362-2425: LGTM!

The RemoveDeploymentArgs and removeArgsFromContainer functions follow the same patterns established by the Set* counterparts and correctly handle removal of all argument styles, including properly removing both the flag and its value when using --flag value style.

tests/csp_health_monitor_test.go (2)

417-459: LGTM on test setup.

The setup correctly configures the STORE_ONLY processing strategy and properly initializes the test environment with GCP annotations.


461-502: LGTM on test assertion.

The assess step correctly verifies that with STORE_ONLY strategy, the node is not cordoned and the CSPMaintenance condition is not applied.

Signed-off-by: Tanisha goyal <[email protected]>
@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-csp-monitor branch from 0fdaae2 to 400aab2 Compare January 13, 2026 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant